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A general lower bound is developed for the minimax risk when 
estimating an arbitrary functional. The bound is based on testing 
two composite hypotheses and is shown to be effective in estimating 
the nonsmooth functional |Si | from an observation Y ~ N{6, /„). 
This problem exhibits some features that are significantly different 
from those that occur in estimating conventional smooth functionals. 
This is a setting where standard techniques fail to yield sharp results. 

A sharp minimax lower bound is established by applying the gen- 
eral lower bound technique based on testing two composite hypothe- 
ses. A key step is the construction of two special priors and bounding 
the chi-square distance between two normal mixtures. An estimator 
is constructed using approximation theory and Hermite polynomials 
and is shown to be asymptotically sharp minimax when the means are 
bounded by a given value M. It is shown that the minimax risk equals 
/3^Af^( '°^'^^" )'^ asymptotically, where /?« is the Bernstein constant. 

The general techniques and results developed in the present paper 
can also be used to solve other related problems. 



1. Introduction. Minimax risk is one of the most commonly used bench- 
marks for evaluating the performance of any estimation method. For this 
reason considerable effort has been made developing minimax theories in 
the nonparametric function estimation literature. A key step in all these 
developments is the derivation of minimax lower bounds. Several effective 
lower bound techniques based on testing have been introduced in the litera- 
ture, and it is often sufficient to derive the optimal rate of convergence based 
on testing a pair of simple hypotheses. Le Cam's method is a well-known 
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approach based on this idea. See, for example, Le Cam (1973) and Donoho 
and Liu (1991). 

For estimation of quadratic functionals the story is somewhat more com- 
pHcated. If the parameter space is not too "large," regular parametric rate 
of convergence can be attained. However Bickel and Ritov (1988) showed 
that when the parameter space is too large, the essential difficulty of such 
problems cannot be captured by testing a simple null versus a simple alter- 
native. Instead rate optimal lower bounds can often be provided by testing 
a simple null versus a composite alternative where the value of the func- 
tional is constant on the composite alternative. See, for example, Cai and 
Low (2005), where upper and lower bounds are constructed for quadratic 
functionals over many different parameter spaces. 

Recently some nonsmooth functionals have been considered. A partic- 
ularly interesting paper is Lepski, Nemirovski and Spokoiny (1999) which 
studies the problem of estimating the norm of the drift function under 
the white noise model. One of the key observations in this paper is the need 
to consider testing between two composite hypotheses where the Lr norm is 
not constant on either of these composite hypotheses and where the sets of 
values of the functional on these two hypotheses are interwoven. These are 
called fuzzy hypotheses in the language of Tsybakov (2009). 

The purpose of the present paper is to advance these ideas further. We 
first develop a new general minimax lower bound technique for estimating 
any functional T based on testing two composite hypotheses. For any two 
priors, say no and ^i, on the parameter space we obtain a lower bound 
on the expected squared bias with respect to fii under a constraint on the 
upper bound of the expected mean squared error with respect to /iQ. The 
lower bound depends on the difference between the expected value of T over 
each of the priors and also on the variance of T under ^o- The bound also 
depends on the chi-square distance between the two marginal distributions 
of the observations, one over /iq, the other over ni. Some of the technical 
tools for deriving minimax lower bounds developed earlier in the literature 
can be seen as special cases of the general result given in the present paper. 

We then consider specifically the problem of estimating the ii norm of 
a multivariate normal mean vector. This nonsmooth functional estimation 
problem exhibits some features that are significantly different from those in 
estimating smooth functionals in terms of the optimal rates of convergence 
as well as the technical tools needed for the analysis of both the minimax 
lower bounds and the construction of the optimal estimators. 

Let yi,y2, ■ ■ ■ jUn be independent normal random variables where ~ 
N{6i, 1). The problem of focus in this paper is that of estimating 
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where we assume that either \6i\ < M for some constant M > or that 
there are no constraints on the 9i. In the present paper we develop optimal 
estimators of T{6) along with minimax lower bounds. In particular for the 
bounded case we construct an asymptotically sharp minimax estimator using 
approximation theory and Hermite polynomials. By combining the minimax 
lower and upper bounds developed in later sections, the main results on 
the minimax estimation of the functional T{6) can be summarized in the 
following theorem. 

Theorem 1. Let Y N{e,In) and let T{e) = ^EILil^^l- ^^r a fixed 
constant M > 0, denote by 6.„(M) = {0 G : < M}. Then the minimax 
risk for estimating the functional T{9) based on Y over ©„(M) satisfies 

2 



(2) inf sup E{T - mr = PlM' ( ) (1 + o(l)) 



logn 



where f3^ ~ 0.28017 is the Bernstein constant, and the minimax risk for 
estimating the functional T{9) over M" satisfies 

(3) inf sup £;(f -r(^))^x-^. 

f 6»eR" log??, 

These rates are dramatically different from the usual parametric or al- 
gebraic rates of convergence for estimating smooth functionals. The funda- 
mental difficulty of estimating the functional T{9) can be traced back to 
the nondifferentiability of the absolute value function at the origin. This is 
reflected both in the derivation of the lower bounds and the construction of 
the optimal estimators. Best polynomial approximation and Hermite poly- 
nomials play major roles in the derivation of the lower bounds as well as in 
the construction of the optimal estimators. 

The minimax lower bounds are established by applying the general lower 
bound technique to two carefully constructed composite hypotheses. In the 
present context to obtain good lower bounds, neither prior can be degen- 
erate. A key step is the construction of two mixture priors which have a 
large difference in the expected values of the functional while making the 
chi-square distance between the two mixture models small. In order to turn 
this heuristic idea into an effective tool it is necessary to be able to bound 
the chi-square distance between two normal mixture models. In previous ap- 
plications such bounds have only been given in the much simpler case when 
one of the mixtures is degenerate. See, for example, Cai and Low (2005) and 
Wang et al. (2008). 

The construction of the optimal estimators of the nonsmooth functional 
T{6) is significantly more complicated than those for linear or quadratic 
functionals. For optimal estimation of T{9) over the bounded set Gn(M), 
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we first use the best polynomial approximation G*j^{x) = J2k=o92k^ 
absolute value function |a::|. Then for each i and each k we form an unbiased 
estimate of 6^ using the Hermite polynomials. Putting these terms together 
for a given i yields an estimate of \6i\. An effective estimate of the functional 
T can then be constructed by averaging these estimates of . We show that 
by carefully selecting the cutoff K = the resulting estimator is asymp- 
totically sharp minimax. This estimator is, however, not optimal over the 
unbounded parameter space M". An additional testing step is used to con- 
struct a hybrid estimator and it is shown that the estimator is rate optimal 
for estimating T{6) over M". In addition, we also consider the estimation of 
T(9) over a parameter space where the mean is a high-dimensional sparse 
vector with a small fraction of nonzero coordinates. 

The rest of the paper is organized as follows. In Section 2 we derive 
the general lower bounds for estimating any functional T based on testing 
two composite hypotheses. In Section 3 we bound the chi-square distance 
between two normal mixture models and apply the general lower bound from 
Section 2 to derive minimax lower bounds for estimating the nonsmooth 
functional T{0) given in (1). Section 4 constructs an estimator of T[6) using 
best polynomial approximation and Hermite polynomials and shows that the 
estimator is sharp minimax for the bounded case. Section 5 considers the 
unbounded case. A hybrid estimator is constructed and is shown to attain 
the optimal rate of convergence. Section 6 treats the sparse case. Discussions 
on the connections and differences of our results with other related work is 
given in Section 7. Technical lemmas and some of the main results are proved 
in Section 8. 

2. General lower bound. In this section, a constrained risk inequality is 
developed which immediately yields a general minimax lower bound based 
on testing two composite hypotheses. 

Suppose we observe a random variable X which has a distribution Pq 
where 9 belongs to a given parameter space 0. Let T = T{X) be an estimator 
of a function T{6) based on X and denote the bias of T by B{9) = EqT — 
T{9). Let ©0 and ©i be subsets of the parameter space © where ©o U 
©1 = ©. Let fiQ and fii be two prior distributions supported on ©o and ©i, 
respectively. 

Let nii and vf be the means and variances of T(9) under the priors Hi for 
i = and 1. More specifically, 

mi = j T{e)ni{de) and vf = j {T (6) - rrnf fii{de) . 

Write Fi for the marginal distribution of X when the prior is fii for i = 0, 1. 
Let fi be the density of Fi with respect to a common dominating measure 
of Fq and Fi. For any function g we shall write Ef^^g{X) for the expectation 
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of g{X) with respect to the marginal distribution of X when the prior on 9 
is fio- We shall write EQg{X) for the expectation of g{X) under Pg. 
Finally define the chi-square distance between /o and /i by 

'h(X) ^"1"" 



The following theorem gives a lower bound for the average risk of an 
estimator T under any mixture prior X/iq + (1 — A)/Ui, < A < 1. 

Theorem 2. 
(i) Suppose J Eg{f{X)-T{e)ffio{de)<e^, then 



(4) 



> \mi - mol - (e + ^o)-^- 



(ii) If \mi — ?no| > vqI and < A < 1, then 

(5) 



Ee{f{X) - T{e)f{Xfio{de) + (1 - X)fii{d9)) 



A(l - A)(|mi - mo\ - vplf 
A + (1-A)(I + 1)2 

and in particular 



(6) max / Ee{f{X) - T{e)ffii{de) > 

1=0,1 J 



{I + 2f 



Informally, Theorem 2 says that if the average risk of T under is 
"small," then the change in average bias under and under must be 
"large." In particular, this implies that the average risk under a mixture 
prior is "large." 

Since the maximum risk is always at least as large as the average risk. 
Theorem 2 yields immediately a lower bound on the minimax risk. 

Corollary 1. // |mi — mo| > vqI, then 

(7) sup^,(f(x)-r(^)f >(""^^-"^°"-"°^)' 



.66 " - {1 + 2? 

Simpler versions of constrained risk inequalities have been developed be- 
fore, most often for studying the cost of adaptation and super efficiency. For 
example, a two-point risk inequality was given in Brown and Low (1996) and 
used to study adaptive estimation of linear functionals. The constrained risk 
inequality given in the present paper allows for a richer collection of applica- 
tions and is especially useful when estimating nonsmooth functionals where 
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it is essential to test complicated composite hypotheses in order to obtain 
good minimax lower bounds. In particular the lower bounds given in the 
next section rely on Corollary 1. 

Proof of Theorem 2. We shall also assume without loss of generality 
that mi > mQ. Then 

/i(X)-/o(X) 



EfAiT{X)-mo) 



fo{X) 



Now note that 

\2 



Ef,{T{X)-moY 



Ee{T{X)-moffio{de) 



Ee{f{X) - T{e) + T{e) - moffio{d9) 



Ee{f{X)-T{e)ff,oide) 



+ I {Tie)-mofMd0) 



+ 2 J B{9){T{e)-mo)fio{dO) 

< e"^ + v'^ + 2voe = {e + vof . 
The Cauchy-Schwarz inequality now yields 

Efo { (fix) - mo) (^^^^^(^) } < (Efo inX) - moff' ■ I 

<{e + vo)I. 

Hence, 

(8) mi + J B{e)fii{de) -(mo + J B{e)fio{de)^ < (£ 



{£ + Vo)I, 



and it follows that 

B{e)fiiide) - I B{e)fioide) <mo-mi + {e + vo)I, 

which in turn yields (4). 
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Now consider the quadratic 

(9) J{x) = \x'^ + {1- \){a-bxf, 

where we assume that 0<A<1, a > and 6 > 0. It is easy to check that 
J is minimized when x = x^^m = x+b^(i-x) that at this value a — bx > 

and J(3;min) = x+b^i-l) • fohows that 

(10) Xx"^ + {1- X){max{a-bx,0)f 

is also minimized at this same value. Now we also have 

j B'^{e)i^ii{d9) > (max(?ni - mo - vqI - {l+l)e,0)f. 
It follows that for < A < 1 

Ae2 + (1- A) j B^{e)fii{de) 

> Xe^ + (1 - A)(max(mi - mo - vqI - (/ + l)e, 0))^ 

A(l - A)(|mi -mol - vplf 
A + (l-A)(/ + l)2 ' 

which gives (5). The final inequality (6) follows by setting A = since the 
minimax risk is greater than any Bayes risk. □ 

3. Lower bound for estimating the i\ norm of normal means. We now 

turn to the problem of optimally estimating a particular nonsmooth func- 
tional where the use of the lower bound developed in the previous section 

yields sharp results. Let yi N{6i^ 1), i = 1, 2, . . . , n, and consider the func- 
tional T where 

1 " 

(11) T{e) = -Y,m. 

i=l 

As mentioned in the Introduction, there are two particularly interesting 
cases. One is the bounded case with Q G 9„,(M) where 6„(M) = {6' G M" : 
l^il < -^} with a constant M > 0. Another case is the unbounded case where 
Q G M". It is worth noting that we need to consider the bounded case with 
a bound growing in n in order to solve the unbounded case. In addition, we 
are also interested in the sparse case where ^ is a high-dimensional sparse 
vector with a small fraction of nonzero coordinates. 

In this section the focus is on developing minimax lower bounds. The 
minimax upper bounds and the optimal estimation procedures will be given 
in the next three sections. Best polynomial approximation plays a major 
role in the development of the lower bound and as we shall see later also in 
the development of the upper bound. 
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3.1. Best polynomial approximation of the absolute value function. Op- 
timal polynomial approximation of the absolute value function has been well 
studied in approximation theory. See, for example, Bernstein (1913), Varga 
and Carpenter (1987) and Rivlin (1990). For a given positive integer k, let 
Vk denote the class of all real polynomials of degree at most k. For any 
continuous function / on [—1,1], let 

4.(/)= inf niax |/(x)-G(x)|. 

Cr&Vk xe[-i,i] 

A polynomial G* is said to be a best polynomial approximation of / if 

6k{f)= max |/(x)-G*(x)|. 

x6[-l,l] 

We now focus on the special case of the absolute value function f{x) = \x\. 
Because / is an even function, so is its best polynomial approximation. 
We thus only need to consider polynomials of even degrees. For any positive 
integer k, we shall denote by G^. the best polynomial approximation of degree 
2/c to I a; I and write 

k 

(12) GUx) = Y,92j^''. 

j=0 

The Bernstein constant is defined as 

/3, = lim 2M2fc(/). 

k—>oo 

Bernstein (1913) showed that the limit exists and is between 0.278 and 
0.286. Varga and Carpenter (1987) disproved a conjecture by Bernstein and 
calculated that = 0.280169499. 

The classical Chebyshev alternation theorem states that a polynomial 
G* G Vk is the (unique) best polynomial approximation to a continuous 
function / if and only if the difference f{x) — G*{x) takes consecutively its 
maximal value with alternating signs at least {k + 2) times. That is, there 
exist k + 2 points — 1 < xq < • • • < Xk+i < 1 such that 

[fix,) - G*ix,)] = ±i-iy max |/(x) -G*{x)\, j = 0, . . . , + 1. 

In the case of the absolute value function, the best polynomial approximation 
Gl.{x) has at least 2k + 2 alternation points. The set of these alternation 
points is important in the construction of the least favorable priors used in 
the derivation of the minimax lower bounds given in this section. Divide the 
set of the alternation points of G*i,{x) into two subsets and denote 

(13) Ao = {xe [-1, 1] : \x\ - GUx) = -S2k{\x\)}, 

(14) Ai = {xe [-1, 1] : \x\ - GUx) = 62k{\x\)}. 
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It follows easily from the fact that both and are even functions that 

the set ^0 contains an odd number of points and Ai has an even number of 
points. We shall see later that least favorable priors are necessarily supported 
on Aq and Ai, respectively. Intuitively, this makes the priors maximally 
apart and yet not "testable." It also connects the construction of the optimal 
estimator with the minimax lower bound. 



3.2. Minimax lower bounds. We now state and prove the minimax lower 
bounds for estimating the nonsmooth functional T{9) over the bounded set 
0n(M) and the unbounded set M". The derivation of the lower bounds relies 
heavily on the general lower bound argument given in the previous section. 
It also requires a careful construction of least favorable prior distributions 
^0 cind fii along with finding an effective upper bound for the chi-square 
distance between the marginal distributions. 

Theorem 3. Let yi ~ N{9i, 1), i = 1, . . . ,n, be independent normal ran- 
dom variables, and let T{9) = ^ Yll=i\^i\ - -^'^'^ a fixed constant M > 0, denote 
by 0„(M) = {0 G M" :\9i\ < M}. Then, the minimax risk for estimating T{9) 
over the parameter space G„(M) is bounded from below as 

(15) inf sup E{f -T{9)f>filM^(^^^^\\l + o{l)), 

where /3* is the Bernstein constant. Without any constraint on the parame- 
ters, the minimax risk satisfies 

(16) inf sup E{f - T{9)f > ^r^^{l + o(l)). 
T 6ieR" 9e^logn 

The minimax lower bounds given in Theorem 3 converge to zero at a slow 
logarithmic rate showing that the nonsmooth functional T{9) is difficult to 
estimate. In contrast the rates for estimating linear and quadratic functionals 
are most often algebraic. In particular let 

L(9) = -y"9i and ^(0) = -^^? 
n ^-^ n ^-^ 

i=l 1=1 

It is easy to check that the usual parametric rate of convergence over M" 
for estimating the linear functional L{9) can be attained by the sample 
average y. For estimating the quadratic functional Q{9), the parametric 
rate can be achieved over 0„,(M) by using the unbiased estimator Q = 

We shall show in the next section that the minimax lower bound 
/3^M^( '°^^°^"' )^ for 0„(M) is in fact asymptotically sharp and the rate of 
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convergence t-;^ for R" is optimal. The optimal procedures are constructed 

log Tl 

using the Hermite polynomials. These procedures are much more involved 
than those for estimating the linear and quadratic functionals discussed 
above. 

A crucial tool in the proof of the lower bounds as well as in the construc- 
tion of the optimal procedures is the application of properties of Hermite 
polynomials. Let ff^ be the Hermite polynomial defined by 

For this version of the Hermite polynomial 



(17) -j:<Piy) = {-l)'H,{y)ct>{y). 



(18) J Hi{y)<P{y)dy = kl and J Hk{y)H,{y)(t>{y)dy = 0, 
when j. 

Another key technical tool for the proof of Theorem 3 is the construction 
of two priors with special properties. 

Lemma 1. For any given even integer k > 0, there exist two probability 
measures uq and ui on [—1,1] that satisfy the following conditions: 

• and vi are symmetric around 0; 

• / t''ui{dt) = J t'-VQidt), for 1 = 0,1,. ..,k; 

• J\t\ui{dt) - J\t\uo{dt) = 26k, 

where 6^ is the distance in the uniform norm on [—1,1] from the absolute 
value function f{x) = \x\ to the space Vk of polynomials of no more than 
degree k. 

As discussed earlier, = l3^k~^{l + o{l)) as k^ oo, where is the Bern- 
stein constant. See Section 7 for further discussions. The proof of Lemma 1 
is given in Section 8. 

Proof of Theorem 3. For a given even integer kn, let uq and ui be 
two probability measures possessing the properties given in Lemma 1. Let 
g{x) = Mx and let /ij be the measures on [—M,M] defined by fJ-iiA) = 
Ui{g~^ (A)) for i = and 1. It follows that: 

• Ho and III are symmetric around 0; 

• / t^iii{dt) = J t^iiQ{dt), for / = 0, 1, . . . , kn, 

• J\t\fii{dt)- J\t\fio{dt) = 2M6k„. 

Let fii and /Xq be the product priors fif = YYj=il^i- other words, we 
put down n independent priors on the coordinates. We have 
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and 



Set fo,Miy) = Jcpiy - t)fio{dt) and /i,A/(y) = f (piy - t)ni{dt). Note that 
since g{x) = exp(— x) is a convex function of x, and is symmetric, 



><^(y)exp(^-^M2 
Let be the Hermite polynomial defined in (17). Then 
0(y-at) = ^iffc(y)</)(y)-^, 

fc=0 

and it follows that 

\2 



f (/l.M(y) -fo,Miy)) j„, ^ i_ »/r2fc 

J /oAf(y) ^ fc! 



Now set 

r2 



_ / (nr=i/i.M(y.)-nr=i/o.M(y.))^ , , . 

i-n — I ™ — 7 — r~\ "yi"y2---ayn- 

lli=l/0,A/(2/ij 



Then 



r2_ fiU^^ifiMMln. 
J U■=lfoMy^) 



dy2--- dyn - 1 



(19) 



TT /■ (/l,A/(yi))^ ^ \ ^ 

ny-^:i;(^^^^j-^ 



=fc„+l / 



Now note that kl > (|)^. Hence 

(20) J^J.^e'-yf-fYj-^. 
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Now let kn be the smallest positive integer satisfying A;„ > + (log'iogn)^/^ ■ 



It is easy to check that /„ — )• 0. Noting that vo < ^ and applying Corollary 1 
yields 

inf sup E(f-TW)^>P^5i%fiM(v^ 

and (15) follows. 

For the proof of (16), let M = \/logn and take A;„, to be the smallest 
positive integer satisfying > (1.5)elogn. We may bound In starting from 
(19) and then noting that for some constant Z) > 

/?.< ('l + e"''^D^M*-)"-l 



(3/2)elogn^ 

It is then easy to check that Corollary 1 now yields (16). □ 

Remark 1. In the bounded case, we shall show in Section 4 that the 
minimax lower bound ( '° ^'"^ " ) ^ is asymptotically sharp. It can be 

seen from the proof of Theorem 2 that this minimax risk corresponds to the 
Bayes risk of the least favorable prior which is asymptotically equal to the 
prior |(/io + /Ui)- 

Remark 2. The proof of (16) can be used to show that for any constant 
c > 0, there exists another constant d > such that 

(22) inf sup E{f -T{e)f>-^{l + o{l)). 

4. Optimal estimation of the l\ norm of bounded normal means. Sec- 
tion 3 developed minimax lower bounds for estimating the nonsmooth func- 
tional T[0). Although the minimax lower bounds converge slowly, they are 
also difficult to attain. The difficulty of the estimation problem stems from 
the fact that the absolute value function is not differentiable at 0. In this 
section we shall consider the bounded case and construct an estimator that 
relies on the best polynomial approximation to the absolute value function 
and the use of Hermite polynomials. The estimator is then shown to be 
asymptotically sharp minimax. The unbounded case and the sparse case 
will be treated in the next two sections. 
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4.1. Polynomial approximation. The construction of the rate optimal es- 
timator is involved. This is partly due to the nonexistence of an unbiased 
estimator for \9i\. Our strategy is to "smooth" the singularity at by a poly- 
nomial approximation and construct an unbiased estimator for each term in 
the expansion by using the Hermite polynomials. 

The optimal estimator relies on the best polynomial approximation G*j^ of 
the absolute value function. A drawback of using G*j^ is that it is not conve- 
nient to construct. An explicit and nearly optimal polynomial approximation 
Gk can be easily obtained by using the Chebyshev polynomials. Note that 
the Chebyshev polynomial (of the first kind) of degree k is defined as 

T.(x) = ^^(-l)^'^ (^T^ j 2'=-2^-ix^-2^-. 

The following expansion can be found, for example, in Rivlin (1990): 

(23) H = lr„w + If (-!)'« 

fc=l 

where T2k{x) is the Chebyshev polynomial of degree 2k. Consider the trun- 
cated version of the expansion (23) and let 

(24) o,(,) = ir„(x) + lf;(-l)W^. 

k=l 

We can also write Gk{x) as 

K 

(25) GK{x) = Y,92kx''\ 

k=0 

The following lemma provides uniform error bounds of and Gk over 

the interval [—1,1] as well as bounds on the coefficients 92k- These 
bounds are useful in the analysis of the optimal estimators. 

Lemma 2. Let G*j^{x) = Ylk=o 92k-^'^'' ^^^^ polynomial approxima- 

tion of degree 2K to \x\, and let Gk be defined in (24)- Then 

(26) uiaK \G*Kix)-\x\\< ^{1 + 0(1)), 



(27) max \Gk {x)-\x\\ <———-. 
xG[-i,i] 7r[2K + 1) 

The coefficients ^2^ ^'^^ 92k satisfy for all < k < K , 

(28) \9*2k\<2^'' and \g2k\<2^''. 
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The uniform error bounds (26) and (27) were proved in Bernstein (1913). 
The proof of the bound on the coefficients (72fc ^'^d g2k is given in Section 8. 

4.2. The construction of the optimal estimator. We shall now use the 
best polynomial approximation G*j^{x) and the Hermite polynomials to 
construct an estimator of T{6) that is asymptotically sharp minimax over 
the bounded parameter space Gn(M). We first consider the special case of 
M = 1. The case of a general M involves an additional rescaling step. 

When M = 1, it follows from Lemma 2 that each \9i\ can be well ap- 
proximated by G*j^{6i) = ^k=Q92k^i^ oil the interval [—1, 1] and hence the 
functional T{9) = ^ can be approximated by 



n K 



k=0 

where h2k{(^) = -Y17=i0i'' ■ Note that T{0) is a smooth functional, and we 
shall estimate h2k{G) separately for each k by using the Hermite polynomials. 

Let (f) be the density function of a standard normal variable. Recall that 
for positive integers k, 

(29) -^<l^{y) = {-lfHk{y)m, 

where Hj. is a Hermite polynomial with respect to (p. It is well known that for 
X ~ iV(/i, 1), Hi^{X) is an unbiased estimate of /.t^' for any positive integer k, 
that is, EHk{X) = ^J'. 

Since Hj^{yi) is an unbiased estimate of 6^ for each i, we can estimate 
bk{0) = ^ TJU 0^ by ^fc = ^ EILi HkiVi) and define the estimator of T{e) by 

K 

(30) T^) = Y,9*2kB2k. 

k=0 

For estimating the functional T{6) over the bounded parameter space 
@n{M) for a general M > 0, we shall first rescale each 9i and then approxi- 
mate \6i\ term by term. More specifically, let \9'^\ = M~^6i. Then \6'-\ < 1 for 
i = 1, . . . ,n and 

m\-GWi)\<^il + o{l)) foran|0^|<l. 



Hence, 



G*K{0i)\<^{l + o{l)) forall|0i|<M, 



where G*j,ix) = Etodk^'' with g*2k = Qlk^"^^^^ ■ 
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Again, Hk{yi) is an unbiased estimate of O'l. We estimate b2k{G) = 



1 " 

(31) B2k = -y^H2k{yi) 

1=1 

and define the estimator of T{9) by 

K K 

(32) T^de^'I) = Y,~9lkB2k = Y,9kM-'"'^'B2k. 

k=0 k=0 

The performance of the estimator Tk{0;M) clearly depends on the choice 
of the cutoff K. We shall specifically choose 

log n 
2 log log n 

and define our final estimator of T(9) by 

K:, 

(34) T^{9) = TiZieTM) = ^ r2kB2k- 

4.3. Optimality of the estimator. We now study the property of the esti- 
mator defined in (34). The following result shows that the estimator T^{9) is 
asymptotically sharp minimax, that is, it achieves the exact minimax lower 
bound given in Theorem 3 asymptotically. 

Theorem 4. Let yi N {6i,l) he independent normal random variables 
with \6i\ <M, i = l,...,n. Let T{9) = n~^Yl^^^\6i\. The estimator T^{6) 
given in (34) satisfies 

(35) sup E{f4e)-T{e)f<f]^,M^(^-^f^)\l + o{l)). 



6»e0„(A/) 



logn 



Remark 3. If Gk{x), instead of G*j^{x), is used in the construction of 
the estimator T^{9), the resulting estimator T{9) satisfies 

(36) sup E{fi9) - T{9)f < A-r-^M^ / loglogn \ ^ ^^^^y 

eG0„(Af) V logn ; 

The ratio of this upper bound to the minimax risk is 47r~^//3^ ~ 5.16. 

We need the following variance bounds for the proof of Theorem 4 as well 
as other results given in the later sections. 
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Lemmas. Let X ^ N{h,1), then 



Consequently 



\sx{Hk{X)) < E{Hl{X)) < e'^'A:^ 



If < M and > k, then 

Var(i7fc(X)) < E{Hi{X)) < {2M^)K 

The proof of Lemma 3 is given in Section 8. 

Proof of Theorem 4. In the proof we shall assume M > 1. The case 
of M < 1 is similar. Note that EB2k = &2fc(^) for k>Q and hence, 

K n 

^ — ' n ^ — ^ 

k=0 i=l 

The bias of T{9) can then be bounded easily as follows. For any 6 G G„(M), 



\ETk{9;M)-T{0)\ 



^E<iH«.)-iEi«.i^^Ei'5: 



i=l 



1=1 



i=l 



Now we consider the variance of Tk{0',M). It follows from Lemma 3 that 
the variance of satisfies 

n 

Var(^2fc) =n-2^Var(F2,(y.)) < e''\2kf'n-\ 

i=l 

To bound the variance of Tk{9', M), first note that for any random variables 
Xi, i = l,...,n, 

(37) E(±x)i < (j2{EX!yA . 



,i=l 



,i=l 



It then foUows that for aU 9 G 9„(Af), 

Var(T,^^Mf)) < l^\~g*2k\'^a?^^{B2k) 
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(38) E{TK{e- M) - T{e)f < ^^i— 2 (1 + + 2e*^ 2^^K^^n 



<2e^^'28^J^2X^-i_ 
Hence, the mean squared error of Tk{0',M) is bounded by 

^2^^lM^f^ , ^^^^^^ , n,^M^^8K T^2K„-1 
{2Kf 

Now set 

2 log log n 

Then the second term in (38) is negligible relative to the first term and we 
have, for all 9 £ e„(M), 

^2/ «2,,r2/^loglogn^ ^ 



EiT4e)-Tie)y<PtM'\^-^^j (1+0(1)). □ 

5. Estimating the ix norm of unbounded normal means. We now turn 
to the unbounded case where no restriction is imposed on the values of 
the means 6i. This case is more difficult than the bounded case. We shall 
construct an estimator of T{9) that attains the optimal rate of convergence, 
but not the optimal constant, for the unbounded case. In the construction 
below, both G*j^ and Gk work. For concreteness, hereafter we shall focus on 
using Gk instead of the best polynomial approximation GJ^. 

It turns out that a key step toward solving this general problem is to 
understand the estimation problem where the means are bounded with the 
bound growing with the sample size n. We shall thus first treat this case 
and then consider rate-optimal estimation for the general case. 

5.1. Estimating the ii norm with a growing bound. Suppose yi N{9i, 1), 
i = 1, 2, . . . , n, where \0i\ < Mn for i = 1, . . . , n, with M„ = ^/c]ogn for some 
c > 1. As in the last section, we estimate T{9) by first rescaling and define 
the estimator of T{9) by 



K 

(39) Tid9^n) = J2 92kB2k, 



k=0 

where hk = g2kM-^^+^ and B2k = I Eti H2k{yi)- 

Theorem 5. Let yi ~ N{9i, 1) be independent normal random variables 
with \9i\ < Mn, i = 1, . . . ,n, where Mn = c log n for some c > 1. Let T{9) = 
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n ^'^i'=i\Oi\- The estimator Tft:(0;M„) given in (39) with K = ^\o^n — 
(logn)^/^ satisfies 

(40) sup E{Tid0^In)-T{e)f <^{\ognr\l + o{l)). 

eee„(A/„) 

This upper bound together with the minimax lower bound (22) show 

that the estimator TxiO^Mn) defined in (39) with K = ifogn — (logn)-^/^ 
is minimax rate optimal in this case. We shall show that the difficulty of 
estimating T{9) over is essentially the same as estimating over 0„(M„) 
with an appropriate choice of M„ of order ^/^ogn. However, the construction 
of the rate-optimal estimator of T{9) over is much more complicated. 
The proof of Theorem 5 is given in Section 8. 

5.2. Rate optimal estimator for the unbounded case. We now turn to 
the unbounded case. It is helpful to provide some intuition and motivation 
before we formally describe the estimation procedure. Consider the one- 
dimensional case. Suppose we observe X ~ N{p, 1) and wish to estimate \^\. 
Set Mn = SVlogn. Let n' = M'^fi. Then < 1 and 

y\-GK{f^')\< ^^J^^^ for all |/.'| < 1. 



Hence, 



where G/f(/x) = T.k=opkl^ with g2k = 92kM^^''+^. Again, Hk{X) is an 

1 

12 



unbiased estimate of . Set = -r^ log n and 



K 

(41) Sk{x) = Y,92kM-^^^^H2k{x). 

k=0 

We define an estimator of by a truncated version of Sk{X), 

(42) 6{X)=mm{SK{X),n}. 

It is easy to see that S{X) is a good estimate of |^| when |//| is small. 
On the other hand, when |^| is large, 6{X) is no longer a good estimator 
of \fi\ because the variance of 6{X) is very large. When \fi\ is large, a good 
estimate of is simply \X\. Therefore, for the unbounded case, a good 
strategy is to estimate by S{X) when \X\ is not too large and estimate 
\fi\ by \X\ when \X\ is large. 

We now formally state the procedure for estimating T{9) as follows. We 
shall first use the idea of sample splitting. Note that observing yi ~ A^(^j, 1) 
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is equivalent to observing yn '~ ' N{di, 2), for I = 1,2. [One can generate yn 
and yi2 from yj. Let Zi ~ A^(0, 1) be independent of yi and set yn =yi + zi 

and yi2 = yi - Zi. Then yu ''^ ' N{9i,2).] Write xu = -^yu for 1 = 1,2 and 

i = 1, . . . , n. Then xu ''^ ' (6*^, 1), for / = 1, 2, with (9^ = 6*^/^/2. Estimating 
T{9) based on {yi} is thus equivalent to estimating y/2T{0') based on {xu}. 
We shah construct an estimate T{ff) for T{e') and estimate T{e) by \/2T(^). 
We define the estimate of TiO') = n'^ j27=M\ by 

, ^ I " 

(43) T{9') = - Y,{Hxii)I{\^i2\ < 2x/21ogn) + |x,i|/(|x,2| > 2^2\ogn)}, 

1=1 

where 6(-) is defined in (41) and (42), and define the estimator of T{9) = 

(44) T{e) = V2T{P). 

Here \xi2\ is used to test to size of 6'^, and based on the test we use either 
6(xii) or to estimate 

The following theorem shows that T{9) attains the rate of convergence 
(logn)~^ over the whole parameter space M". 

Theorem 6. The estimator T{9) defined in (43) and (44) satisfies, for 
all ^ G R", 

(45) E{f{9)-T{9)f<-^{l + o{l)) 

logn 

for some constant C > 0. 

Together with the minimax lower bound given in Theorem 3, Theorem 6 
shows that the hybrid estimator is rate optimal over the parameter space 
M". The proof of Theorem 6 is involved and is given in Section 8. The key 
is to analyze the bias and variance of a single component. 

6. Estimating the £i norm of sparse normal means. In high-dimensional 
problems, an especially interesting case is when the mean vector is sparse, 
that is, only a small proportion of the 9i's are nonzero. Suppose we ob- 
serve yi N{9i, 1), i = 1, 2, ... ,71, where the mean vector 9 is sparse: only 
a small fraction of components are nonzero, and the locations of the nonzero 
components are unknown. 

Denote the (.q quasi-norm by 116*110 = Card({« : 0j 7^ 0}). Fix kn, the collec- 
tion of vectors with exactly /c„ nonzero entries is 

efc„=4(A;n) = {eGM":||0||o = A:„}. 
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In this section we consider the problem of estimating the average of the 
absolute value of the nonzero means. For 9 G , 

1 " 

(46) T{e) = average{|e,| : 0, / 0} = — ^ 

i=i 

We calibrate the sparsity parameter kn by kn = for < /3 < 1 . The 
following result shows that for < /5 < ^, it is not possible to estimate the 
functional T{0) consistently. 

Theorem 7. Let kn = ■ Then for all < /? < ^, the minimax risk 
satisfies 

(47) inf sup E{T{e)-T{9)f>C 

for some constant C > 0. 

The proof of Theorem 7 is analogous to that of Theorem 7 in Cai and 
Low (2004), and we omit it here for reasons of space. 

We now turn to the more interesting case where kn = with ^ < /? < 1 . 
The following result show that the minimax rate of convergence in this case 
is (logn)~^. 

Theorem 8. Let kn = for some ^ < (3 < I. Then the minimax risk 
for estimating the functional T{9) over Qk„ satisfies 

(48) inf sup E{T{e)-T{9)f>:-^. 
T(?)eeefe„ logn 

The proof of the lower bound in Theorem 8 is similar to that of Theorem 3. 
The upper bound can be attained by a modified version of the estimator T{6) 
defined in (43) and (44). The key in the construction is to have estimates 
of the individual coordinates that perform well when the coordinates are 
zero. This can be achieved by using the polynomial approximation Gk{x) 
[or G*j^{x)\ without the constant term. 

As in Section 5.2 set K = ^logn, and define 

K 

(49) ~Sk{x) =Y,92kM-^'"''H2k{x). 

k=l 

Note that here the constant term go is excluded. We then define an estimator 
of \fi\ by truncating SxiX), 

(50) 5iX) = mm{SKiX),n'^}. 
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Note that the bias of the estimator 5{X) is much smaller than the bias of 
S{X) when the mean of X is zero. As in Section 5.2 we split the sample 
into two parts and use one for testing and the other for estimation. Let 

Xii be defined as in Section 5.2. That is, xu A^(^^, 1), for / = 1,2, with 
9'. = Qil^. We define the estimate of T{e') = kn~^ YTi=\%\ by 

1 " 

(51) T{e>) = — J^{,5(xii)I(|xi2| < 2^/2 log n) + |xa|/(|x,2| > 2V21ogn)}, 
where is defined in (49) and (50), and set the estimator of T(0) = 

(52) Tie) = \f2T(e>). 

It can be shown that the estimator T{Q) is rate optimal for estimating T{Q) 
over 6fc„. The proof is similar to that of Theorem 6, and we omit the details 
here. 

7. Discussions. The present paper was partly inspired by the general 
theory of estimating functionals based on i.i.d. observations given in Donoho 
and Liu (1991) which showed that bounds on minimax estimation can be 
based on testing two composite hypotheses. The difficulty of the composite 
testing problem was shown in Le Cam (1973, 1986) to depend on the total 
variation distance between the convex hulls of the two composite hypotheses. 
In the present context the priors //q and n\ used in the general lower bound 
of Section 2 can be viewed as picking two points in the convex hull of two 
subsets of the parameter space and Theorem 2 gives bounds on the risk over 
these two points. Sections 3 and 4 show that a careful choice of these priors 
yields sharp minimax lower bounds for estimating the l\ norm of the means 
of normal random variables. 

Best polynomial approximation played a major role in the development 
of our results, both for the upper and lower bounds. Note that the last two 
conditions in Lemma 1 yield 

I (|t| - cmy^m - 1 i\t\ - Gut))uo{dt) = 26k. 

From the definition of Gl we have —6k < \t\ — Gl{t) < 6k for ah — 1 < t < 1. 
Since fo and i^i are probability measures it follows that they are supported 
on the subsets Aq and Ai of the alternation points defined in (13) and (14), 
respectively. 

We should also emphasize that the values of the functional T on the two 
sets of support points are not well separated. In fact the values alternate. 
This is quite different from the more standard cases of estimating a linear 
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or quadratic functional. In the case of quadratic functional, even though 
the alternative hypothesis may need to be composite, the functional only 
takes on two values, one on the null and the other on the alternative. See, 
for example, Cai and Low (2005). 

The techniques given here can also be compared to those found in Lepski, 
Nemirovski and Spokoiny (1999) where attention was focused on estimat- 
ing the Li norm of a regression function. In that paper lower bounds were 
constructed by mixing in a way similar to that used in the present pa- 
per. However, instead of bounding a chi-square distance, a bound was given 
for the Kullback-Leibler distance. It is, however, not easy to provide good 
bounds directly for the Kullback-Leibler distance. This is particularly true 
in cases which correspond to parameter spaces with growing bounds. The 
lower bounds provided there only work in the case where the parameter 
space has a fixed bound. 

For upper bounds, Lepski, Nemirovski and Spokoiny (1999) used a Fourier 
series approximation of \x\, and the estimate is based on unbiased esti- 
mates of individual terms in the approximation. The maximum error of 
the best K-term Fourier series approximation can be shown easily to be of 
order K~^, which is comparable to the best polynomial approximation of 
degree K. However, the variance bound of the estimator based on the K- 
term Fourier series approximation is of order e'-^^ for some constant C > 0, 
whereas the variance of our estimator based on the polynomial approxima- 
tion of degree K grows at the rate of = _ gg the variance of the 
polynomial-based estimator is much smaller than that of the corresponding 
estimator using Fourier series, even though the biases of the two estimators 
are very similar. This allows for more terms to be used in the polynomial 
approximation with the same variance level thus reducing the bias of the 
estimate. In the bounded case, the best rate of convergence for estimators 
using Fourier series approximation can be shown to be (logn)~^, which is 
sub-optimal relative to the minimax rate ( ^°^^°^" )^. Another drawback of 
the Fourier series method is that it cannot be used for the unbounded case. 

The techniques and results developed in the present paper can be used to 
solve other related problems. For example, when the approach taken in this 
paper is used for estimating the Li norm of a regression function, both the 
upper and lower bounds given in Lepski, Nemirovski and Spokoiny (1999) 
are improved. For reasons of space, we shall report the results elsewhere. 
The techniques can also be used for estimating other nonsmooth functionals 
such as excess mass. See Cai and Low (2010). 

8. Proofs. In this section we first prove the technical lemmas given in 
the earlier sections. We then prove Theorem 5 in Section 8.2. The proof of 
Theorem 6 is involved and will be given in Section 8.3. 
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8.1. Proof of technical lemmas. 

Proof of Lemma 1. The proof of this lemma rehes on the Hahn- 
Banach theorem and the Riesz representation theorem. The argument is 
essentially the same as the one given in Lepski, Nemirovski and Spokoiny 
(1999). We include it here for completeness. 

Consider the space C(— 1,1) of continuous real- valued functions on the 
interval [—1,1] with uniform norm || • ||oo- Clearly f(t) = \t\ defined on this 
interval [—1, 1] belongs to C(— 1, 1). Let 6^ be the distance in uniform norm 
on [—1, 1] from the function / to the space of polynomials of order k. Let Vk 
be the linear space spanned by the collection of polynomial of order k and 
in addition let J^k be the linear space spanned by Vk and /. Note that every 
element g £ Tk can be written uniquely as g = cf + pk where pk € Vk and 
c € M. Let T be the linear functional defined by T[g) = T{cf +pk) = c5k - It is 
then clear that T = on Vk and T(/) = 5k. Now the norm of the functional 
T is given by 



It can be checked directly that the norm of this functional is equal to 1. 
Let G\ be the closest polynomial in Vk to f. Then ||/ — G^Hoo = 6k, and it 
follows that j^if- Gl) has a norm of 1. Since T{-^{f - Gl)) = 1 it follows 
that ||T|| > 1. Now suppose that ||r|| > 1. Then there exists an element 
g = cf + Pk with Pk G Vk such that ||ff||oo = 1 and T{g) > 1. This implies 
that c > x" and 



Since —-Pk S Vk, this is a contradiction to the definition of 5k which is the 
distance between / and Vk- 

Now by the Hahn-Banach theorem the linear functional T can be ex- 
tended to C( — 1, 1) without increasing the norm of the functional. For sim- 
plicity we shall also call this linear functional T. It then follows from the 
Riesz representation theorem that for each g £ C(— 1, 1) 



where r is a Borel signed measure with total variation equal to 1. 

It follows from Hahn- Jordan decomposition that there exist two positive 
measures r+ and r_ such that r = r_|_ — r_ . It then follows that 



T\\ = sup{T{g) : g € Tk,\\g\\oo < 1}- 





(53) 





for / = 0,1,... 



k. 
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Define the measures r* and by r*(5') =r_(— S) and t^{S) = r+(— S) 
for all measurable sets 5. Then (53) holds with t_ and r+ replaced by r* 
and T^, respectively. Hence (53) is also true with r_ and r+ replaced by 
(r_ +r*)/2 and (r+ + rj]l)/2, respectively. We can thus assume that r is 
symmetric. 

Now take = 2t. Then z/ is symmetric and 



(54) 



j \t\u{dt) = 25k and j t^u{dt) = for/ = 0,1, 



Now let ui and fo be the positive and the negative components of u. 
Then both z^i and i^q are symmetric. Since v has variation equal to 2 and 
f-i ^{dt) = it follows that vi and uq are both probability measures. 

These measures also clearly satisfy by construction 



rvi{dt) 



t^vo{dt) 



for / = 0, 1, 



, k and also 

nl 



j \t\ui{dt)-j \t\uo{dt) = 26k. 



□ 



Proof of Lemma 2. The Chebyshev polynomial can be alterna- 
tively written as 



(55) 



1=0 



i=m-l 



2m 
2j 



Write T2m{x) = TZo^^ix^^ ■ Then 



J 

m — I 



„2l 



(56)|t2/|= E 



2m 
2j 



J 

m — I 



< 



E 

j=m-l 



2m 
2j 



m 
m — I 



It is now easy to see that the coefficient for 'j?^ in the polynomial Gk{x) is 
bounded from above by 



, , 4^ 23i 

j=k ■> 



1 



<2 



3iC 



The bound on the coefficients of the best polynomial approximation 
follows from Theorem E in Qazi and Rahman (2007) and the bound (56). 
□ 
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Proof of Lemma 3. Write X = fj, + z with z ~ N{0, 1). It is well known 
that E{Hl{z)) = k\, E{Hi{z)Hj{z)) = for i / j, and 

k 



j=0 

Hence, 

k k 



EHliX) = EH\f, + z) = Y,i2(^) (^)f^'^'E{H,^,{z)H,^,{z)) 



j=0 
k 

j=0 



□ 



Note that kl/jl < ^ and hence, 

EHUX) = (*) < ( ■) = + S 

If < M and > k, for all < j < /c, ^^i < m^J ^ < M^'^i. Hence, 

EHliX) = klJ2 ['}) < kl E ( • ) M''l_ = {2Mr. 

8.2. Proof of Theorem 5. For 6 = {6i, . . . , On) G M", denote 

1 " 

i=l 

Note that = bk{0) for /c > 0, and hence 

K n 

ET{e) = J2~92kb2k{e) = -^GK{ei). 

k=0 i=l 

The bias of T{9) can then be bounded easily as follows: 



\ETie)-T{9)\ 



n ^ — ^ rt ^ — * r) ^ — ^ 

i=\ 

2M„ 



n — ' n 

i=\ 1=1 



n . 



- 7r(2K + l)' 
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Now we consider the variance of T[9). Note that > K. In this case, 
the variance of Bk can be bounded by 

n 

Var(52fc) = ^ Var(F2fc(yi)) < n~\2A4f\ 

i=l 

Hence 

Var(f(^))<|j^|52fc|VarV2(52fc)| < jj^ |52fc|M-2'=+i2'=Mf | ■ n'' 

With K = i log2 n - (log n) the mean squared error is then bounded by 
E{f(e) - T{9)f < + ^M^^ . n-' = l^(logn)-i(l + o(l)). 

8.3. Proof of Theorem 6. We now analyze the properties of the hybrid 
estimator defined in (44) . The key is to study the bias and variance of a single 

component. Let xi,X2 -^(a*) 1)) and let 

(57) C = C{xi,X2)=S{xi)I{\x2\ < 2V21ogn) + |xi|/(|x2| > 2V21ogn). 
Note that 

E{^) = E6{xi)P{\x2\ < 2v/21ogn) + ^|xi|P(|2;2| > 2y^2k^). 



Lemma 4. Suppose I (A) is an indicator random variable independent of 
X and Y , then 

Yar{XI{A) + YliA")) = Vav{X)P{A) + Yav{Y)P{A'') 

(58) 

+ {EX - EYfP{A)P{A^). 

Applying Lemma 4, we have 

Var(0 = Var((5(j;i))P(|x2| < 2V'21ogn) + Var(|xi|)P(|x2| > 2v^21ogn) 

(59) ^ 

+ {ESixi) - E\xi\yP{\x2\ < 2^21ogn)P(|x2| > 2v/21ogn). 

We also need the following lemma for the variance of 6. [The proof is 
similar to Lemma 2 in Cai and Low (2005).] 
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Lemma 5. For any two random variables X and Y 

(60) Var (min{X, Y} ) < Var X + Var Y. 

In particular, for any random variable X and any constant C 

(61) Var(min(X, C)) < Var X 

Proof. Without loss of generality we can assume E{X) = and E{Y) < 
0. Let Z = mm{X,Y}. Then 

(62) EZ^ < EX"^ + EY"^ 
and 

(63) EZ<E{Y). 
Hence (EZ)'^ > (EY)"^ and consequently 

(64) Var Z = EZ'^ - {EZf < EX^ + EY^ - {EY f = Var X + Var Y. □ 

Lemma 6. Let X ~ N{^, 1) and Sk{x) = Ylk=o92kM-'^''+^H2k{x) with 
Mn = 8\/log n and K = j2 logn. Then for all \fi\ < 4-^2 logn, 



(65) \ESKiX)-M< 



2Mn 



tt{2K + 1) 



(66) ESl{X)<n^^^ log 



Proof. The first part follows from Lemmas 2 and 3 and the discus- 
sions in Section 5.1. To bound ES'j^{X), it follows from inequality (37) and 
Lemmas 2 and 3 that 

ESliX) < (^|;i<72.|M-2^-+i(ii;i7|,(X))V2^ 



< 2^^ ^(8yk^)^^'=+^(641ogn)'= 



\k=l 



<n^/^log^n. □ 



Write B{^) = E{(,) — \fj,\ for the bias of ^. We divide into three cases ac- 
cording to the value of |//|. In the first case when |^| < \/2Togn, we shall 
show that the estimator behaves essentially like S{xi) which is a good esti- 
mator when \^\ is small. In the second case when -^2 logn < |/i| < 4\/2 logn, 
we show that the hybrid estimator uses either 6{xi) or |xi| and in this case 
both are good estimators of In the third case when |//| is large, the 
hybrid estimator is essentially the same as 
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Case 1. < V2 logn. Note that 6{xi) can be written as 6{xi) = Sxixi) 
{Sk{xi) — n)7(5i^(xi) > n) and consequently 



(67) 



= \{E5{xi)P{\x2\ < 2 ^2 log n) + ^|xi|)P(|x2| > 2^/2 log n) - 
< |S5x(xi) - |/i|| + - n)/(5i^(xi) > n)} 



+ i\ESK{xi)\+E\xi\)P{\x2\ > 2x/21ogn) 
= i?l + -B2 + -B3. 
Lemma 6 yields that 

2M„ 



5l = |S5A'(xi)-|/i||< 



7r(2i^ + l)' 



It follows from the fact < ^1 logn and the standard bound for normal 
tail probability z) < z~^(f){z) for z > that 

(68) P(|x2| >2v^21ogra)<2^(-^21ogra)< ^ ^ 

V vr log n 

Note that in this case 

(69) \ESK{xi)\ = \GK{fi)\<\fi\+ 



tt{2K + 1) 



(70) E\xi\ = + 2(j){fi) - 2\fi\<^>{-\fi\) < + 1 < V2Iogn + l. 
It then follows from (68)-(70) that 

/ r-. 2M„ \ 1 1 1 

Now consider i?2- Note that for any random variable X and any constant 
A>0, 

(71) E{XI{X > A)) < X-^^E{X'^I{X > A)) < X^^EX'^. 
This together with Lemma 6 yields that 

(72) B2 < E{Sk{xi)I{Sk{xi) > n)} < n-^ES]^{xi) < n'^^^ log^ n. 

Combining the three terms together shows that in this case the bias is 
bounded by 

1^(01 <Bi + B2 + Bs<^il + 0(1)). 
We now consider the variance. It follows from (59) and Lemma 5 that 



Var(^) < Var(5i^(2;i)) + Var(|xi|)P(|x2| > 2^2 log n) 
+ {E6{xi) - E\xi\fP{\x2\ > 2^2 log n) 



< ESlixi) + ^x?P(|x2| > 2^2 log 7i). 
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Lemma 6 and equation (68) together yield that 



Case 2. ^21ogn < < 4-^2 log n. In this case, 



\B{0\ = \iE6{xi)P{\x2\ < 2V21ogn) + ^|xi|)P(|x2| > 2V21ogn) - |/i|| 

< \E6{xi) — \fj,\\ + |i?|a;i| — 

< \ESk{xi) -M+ E{{Sk{xi) - n)I{SK{xi) > n)} + 20(/x). 
Note that \ESk{xi) - < ^i^2K+i) ^^^"^ C^^) 

E{{Sk{xi) - n)I{SK{xi)>n)] <n-^/'^\og^ n. 
Note that < 0(V21ogn) < n^^. Hence, again the bias is bounded by 

|i?(e)i<^(i + o(i)). 

For the variance, equation (59) and Lemma 5 yield that 

Var(0 < Var(5x(xi)) + Var(rEi) + {E6{xi) - E\xi\f . 

Note that 

{E6{xi) - E\xi\f < [ESk{xi) - M+E{{SK{xi)-n)I{SK{xi)>n)} 



-2</.(/i) + 2|^|$(-|/i|)]2 



A/2 



Hence, it follows from Lemma 5 that 

Var(e) < ES\{xi)+\ai{xi) + {E5{xi) - E\xi\ 
<n^''^\og'n{l + o{l)). 



Case 3. > 4-^2 log n. In this case the standard bound for normal tail 
probability yields that 

P{\X2\ < 2^2 log n) < 2^-{\fi\ - 2v/21ogn)) < 2a>(^-^^ < ^<^(^) • 



In particular. 



P{\x2\ < 2v/21ogn) < 2^>(-2^21ogn) < 



2-v/7r logn 



n 



-4 
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Hence, 

1^(01 < \E\xi\ - \^i\\ + {\E6{xl)\+E\xl\)P{\x2\<2^2]E^) 



< 2(f>{n) + (n + + l)Pi\x2\ < 2V21og 



nj 



For the variance, equation (59) and Lemma 5 yield that 
Var(e) < Var(|xi|) + {Yav{6{xi)) + {E6{xi) - ^|xi|)^)P(|x2| < 2./2hgn) 



< 1 + (3n2 + 2(^2 + i))p(\x2\ < 2v/21ogn) = 1 + o(l). 
Putting the three cases together, we have the following. 

Proposition 1. For all /x G the bias and the variance of the estima- 
tor ^ defined in ( 57) satisfy 

(73) |5(0| < — (l + o(l)) and Var(e) < n^/^ log^ n(l + o(l)). 
ttK 

Proof of Theorem 6. With the detailed analysis of the one-dimensio- 
nal case, we are now ready to give a short proof Theorem 6. It suffices to 
focus on the estimator T(0') given in (43). Note that 



T{9') = -y^C{xa,Xi2 



n ^ — ' 

i=l 

where ^ is defined in (57). It follows from Proposition 1 that the bias 

B(T{6')) of the estimator T{9') is bounded by 

1 " i\/f 

\B{T{9'))\ < - ^ m{xa,Xi2))\ < ^(1 + o{l)), 

i=l 

and the variance of T{6') is bounded by 



Var(f(?0) < —Y.y^r{({xii,xi2)) < —Y.^'^^ ^og^ + o{l)) 



i=l 1=1 



<64n-^/2 logn(l + o(l)). 
Hence the mean squared error of T{6') satisfies 



E{T{e') - T{e')f < B\T{e')) + Var(r(0O) < 4i7o (1 + 0(1)) 

< -^(1 + 0(1)). 

logn 



□ 
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